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Abstract 

Composite convex optimization models arise in several applications, and are especially prevalent in inverse 
problems with a sparsity inducing norm and in general convex optimization with simple constraints. The most 
widely used algorithms for convex composite models are accelerated hrst order methods, however they can 
take a large number of iterations to compute an acceptable solution for large-scale problems. In this paper 
we propose to speed up hrst order methods by taking advantage of the structure present in many applications 
and in image processing in particular. Our method is based on multi-level optimization methods and exploits 
the fact that many applications that give rise to large scale models can be modelled using varying degrees of 
hdelity. We use Nesterov’s acceleration techniques together with the multi-level approach to achieve an OiXjyJT) 
convergence rate, where e denotes the desired accuracy. The proposed method has a better convergence rate 
than any other existing multi-level method for convex problems, and in addition has the same rate as accelerated 
methods, which is known to be optimal for first-order methods. Moreover, as our numerical experiments show, 
on large-scale face recognition problems our algorithm is several times faster than the state of the art. 


1 Introduction 

Composite convex optimization models consist of the optimization of a convex function with Lipschitz continuous 
gradient plus a non-smooth term. They arise in several applications, and are especially prevalent in inverse problems 
with a sparsity inducing norm and in general convex optimization with simple constraints (e.g. box or linear 
constraints). Applications include signal/image reconstruction from few linear measurements (also referred to as 
compressive or compressed sensing) [Ml [TTl [181 HI] i image super-resolution [52] , data classification [48l |49] , feature 
selection [43] , sparse matrix decomposition m, trace norm matrix completion m, image denoising and deblurring 
[23j ■ to name a few. 

Given the large number of applications it is not surprising that several specialized algorithms have been proposed 
for convex composite models. Second order methods outperform all other methods in terms of the number of 
iterations required. Interior point methods m. Newton and truncated Newton methods |29j . primal and dual 
Augmented Lagrangian methods m have all been proposed. However, unless the underlying application posses 
some specific sparsity pattern second order methods are too expensive to use in applications that arise in practice. 
As a result the most widely used methods are first order methods. Following the development of accelerated first 
order methods for the smooth case m and the adaptation of accelerated first order methods to the non-smooth 
case ([51 [55] ~) there has been a large amount of research in this area. State of the art methods use only first order 
information and as a result (even when accelerated) take a large number of iterations to compute an acceptable 
solution for large-scale problems. Applications in image processing can have millions of variables and therefore new 
methods that take advantage of problem structure are needed. 

We propose to speed up first order methods by taking advantage of the structure present in many applications 
and in image processing in particular. The proposed methodology uses the fact that many applications that give rise 
to large scale models can be modelled using varying degrees of fidelity. The typical applications of convex composite 
optimization include dictionaries for learning problems, images for computer vision applications, or discretization 
of infinite dimensional problems appearing in image processing [5]. First order methods use a quadratic model with 
first order information and the Lipschitz constant to construct a search direction. We propose to use the solution 
of a low dimensional (not necessarily quadratic) model to compute a search direction. The low dimensional model 
is not just a low fidelity model of the original model but it is constructed in a way so that is consistent with the 
high fidelity model. 
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The method we propose is based on multi-level optimization methods. A multi-level method for solving convex 
infinite-dimensional optimization problems was introduced in m and later further developed by Nash in [33] , 
Although [33] is one of the pioneering works that uses a multi-level algorithm in an optimization context, it is 
rather abstract and only gives a general idea of how an optimization algorithm can be used in a multi-level way. 
The author proves that the algorithm converges under certain mild conditions. Extending the key ideas of Nash’s 
multi-level method in [46) . Wen and Goldfarb used it in combination with a line search algorithm for solving 
discretized versions of unconstrained infinite-dimensional smooth optimization problems. The main idea is to use 
the solution obtained from a coarse level for calculating the search direction in the fine level. On each iteration 
the algorithm uses either the negative gradient direction on the current level or a direction obtained from coarser 
level models. They prove that either search direction is a descent direction using Wolfe-Armijo and Goldstein-like 
stopping rules in their backtracking line search procedure. Later a multi-level algorithm using the trust region 
framework was proposed in [2^. In all those works the objective function is assumed to be smooth, whereas the 
problems addressed in this paper are not. We also note that multi-level optimization methods have been applied 
to PDE optimization for many years. A review of the latter literature appeared in m- 

The first multi-level optimization algorithm covering the non-smooth convex composite case was introduced 
in m- It is a multi-level version of the well-known Iterative Shrinkage-Thresholding Algorithm (ISTA) ([^, [19) . 
pn]l. called MISTA. In [5T] the authors prove R-linear convergence rate of the algorithm and demonstrate in several 
image deblurring examples that MISTA can be up to 3-4 times faster than state of the art methods. However, 
MISTA’s R-linear convergence rate is not optimal for first order methods and thus it has the potential for even better 
performance. Motivated by this possible improvement, in this paper we apply Nesterov’s acceleration techniques 
on MISTA, and show that it is possible to achieve 0(1/^/e) convergence rate, where e denotes the desired accuracy. 
To the best of our knowledge this is the first multi-level method that has an optimal rate. In addition to the proof 
that our method is optimal, we also show in numerical experiments that, for large-scale problems it can be up to 
10 times faster than the current state of the art. 

One very popular recent approach for handling very large scale problems are stochastic coordinate descent meth¬ 
ods [211 ED. However, while being very effective for ^i-regularized least squares problems in general, stochastic 
gradient methods are less applicable for problems with highly correlated data, such as the face recognition prob¬ 
lem and other pattern recognition problems, as well as in dense error correction problems with highly correlated 
dictionaries [47]. We compare our method both with accelerated gradient methods (FISTA) as well as stochastic 
block-coordinate methods (APGG [21] )■ 

Our convergence proof is based on the proof of Nesterov’s method given in [D , where the authors showed that 
optimal gradient methods can be viewed as a linear combination of primal gradient and mirror descent steps. We 
show that for some iterations (especially far away from the solution) it is beneficial to replace gradient steps with 
coarse correction steps. The coarse correction steps are computed by iterating on a carefully designed coarse model 
and a different step-size strategy. Since our algorithm combines concepts from multilevel optimization, gradient 
and mirror descent steps we call it Multilevel Accelerated Gradient Mirror Descent Algorithm (MAGMA). The 
proposed method converges in function value with a faster rate than any other existing multi-level method for 
convex problems, and in addition has the same rate as accelerated methods which is known to be optimal for 
first order methods [34) . However, given that we use the additional structure of problems that appear in imaging 
applications in practice our algorithm is several times faster than the state of the art. 

The rest of this paper is structured as follows: in Section [2] we give mathematical definitions and describe 
the algorithms most related to MAGMA. Section E) is devoted to presenting a multi-level model as well as our 
multi-level accelerated algorithm with its convergence proof. Finally, in Section E) we present the numerical results 
from experiments on large-scale face recognition problems. 


2 Problem Statement and Related Methods 

In this section we give a precise description of the problem class MAGMA can be applied to and discuss the main 
methods related to our algorithm. MAGMA uses gradient and mirror descent steps, but for some iterations it 
uses a smooth coarse model to compute the search direction. So in order to understand the main components 
of the proposed algorithm we briefly review gradient descent, mirror descent and smoothing techniques in convex 
optimization. We also introduce our main assumptions and some notation that will be used later on. 
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2.1 Convex Composite Optimization 

Let / : R" —>■ (— c», cw] be a convex, continuously differentiable function with an Ly-Lipschitz continuous gradient: 

||V/(x) - V/(y)|U <L/||x-y||, 

where || • ||* is the dual norm of || • || and g : R" —>■ (— 00 , 00 ] be a closed, proper, locally Lipschitz continuous convex 
function, but not necessarily differentiable. Assuming that the minimizer of the following optimization problem: 


+ 5(x)}, (1) 

is attained, we are concerned with solving it. 

An important special case of (H} is when /(x) = |||Ax — b ||2 and g{x) = A||x||i, where A G is a full rank 

matrix with m < n, b G R™ is a vector and A > 0 is a parameter. Then m becomes 

min ^||Ax-b||^ + A||x||i, (Pi) 

The problem (El arises from solving underdetermined linear system of equations Ax = b. This system has more 
unknowns than equations and thus has infinitely many solutions. A common way of avoiding that obstacle and 
narrowing the set of solutions to a well-defined one is regularization via sparsity inducing norms |22j . In general, 
the problem of finding the sparsest decomposition of a matrix A with regard to data sample b can be written as 
the following non-convex minimization problem: 


min ||x||o subject to Ax = b, 

xeR" 


(Po) 


where || • ||o denotes the £o-pseudo-norm, i.e. counts the number of non-zero elements of x. Thus the aim is to recover 
the sparsest x such that Ax = b. However, solving the above non-convex problem is NP-hard, even difficult to 
approximate [2]. Moreover, less sparse, but more stable to noise solutions are often more preferred than the sparsest 
one. These issues can be addressed by minimizing the more forgiving ^i-norm instead of the ^o"Pseudo-norm: 


min llxlli subject to Ax = b. 

xeR" 


(BP) 


It can be shown that problem (IBPI) (often referred to as Basis Pursuit (BP)) is convex, its solutions are gathered 
in a bounded convex set and at least one of them has at most m non-zeros [22j . In order to handle random noise 
in real world applications, it is often beneficial to allow some constraint violation. Allowing a certain error e > 0 
to appear in the reconstruction we obtain the so-called Basis Pursuit Denoising (BPD) problem: 


mm ||x||i subject to ||Ax — b||| < e, 


(BPD) 


or using the Lagrangian dual we can equivalently rewrite it as an unconstrained, but non-smooth problem (El)- 
Note that problem ( |PiD is equivalent to the well known LASSO regression [H]. Often BP and BPD (in both 
constrained and unconstrained forms) are referred to as £i-min problems. 

A relatively new, but very popular example of the BPD problem is the so-called dense error correction (DEC) 
[47j . which studies the problem of recovering a sparse signal x from highly corrupted linear measurements A. It 
was shown that if A has highly correlated columns, with an overwhelming probability the solution of (jBPp is also a 
solution for (E M- One example of DEC is the face recognition (FR) problem, where A is a dictionary of facial 
images, with each column being one image, b is a new incoming image and the non-zero elements of the sparse 
solution X identify the person in the dictionary whose image b is. 

Throughout this section we will continue referring to the FR problem as a motivating example to explain certain 
details of the theory. In the DEC setting, A among other structural assumptions, is also assumed to have highly 
correlated columns; clearly, facial images are highly correlated. This assumption on one hand means that in some 
sense A contains extra information, but on the other hand, it is essential for the correct recovery. We propose to 
exploit the high correlation of the columns of A by creating coarse models that have significantly less columns and 
thus can be solved much faster by a multi-level algorithm. 

We use the index H to indicate the coarse level of a multi-level method: xh G R"® is the coarse level variable, 
fni') '■ R"" —>■ (— 00 , 00 ] and gni') ■ R”^ —>■ (— 00 , 00 ] are the corresponding coarse surrogates of /(•) and g{-), 
respectively. In theory, these functions only need to satisfy a few very general assumptions (Assumption [T]), but 
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of course, in practice they should reflect the structure of the problem, so that the coarse model is a meaningful 
smaller sized representation of the original problem. In the face recognition example creating coarse models for a 
dictionary of facial images is fairly straightforward: we take linear combinations of the columns of A. 

In our proposed method we do not use the direct prolongation of the coarse level solution to obtain a solution 
for the fine level problem. Instead, we use these coarse solutions to construct search directions for the fine level 
iterations. Therefore, in order to ensure the convergence of the algorithm it is essential to ensure that the fine 
and coarse level problems are coherent in terms of their gradients [331146j . However, in our setting the objective 
function is not smooth. To overcome this in a consistent way, we construct smooth coarse models, as well as use 
the smoothed version of the fine problem to find appropriate step-sizefl Therefore, we make the following natural 
assumptions on the coarse model. 

Assumption 1. For each coarse model constructed from a fine level problem m it holds that 

1. The coarse level problem is solvable, i.e. bounded from below and the minimum is attained, and 

2. fni^) o,nd g_fr(x) are both continuous, closed, convex and differentiable with Lipschitz continuous gradients. 

On the fine level, we say that iteration fc is a coarse step, if the search direction is obtained by coarse correction. 
To denote the j-th iteration of the coarse model, we use ^h.j- 


2.2 Gradient Descent Methods 


Numerous methods have been proposed to solve o when g has a simple structure. By simple structure we mean 
that its proximal mapping, defined as 

prox^(x) = argmin j^lly- x|p + (V/(x),y - x) + 5 (y)l, 

yGR" I ^ J 


for some L G R, is given in a closed form for any /. Correspondingly, denote 

' L. 


Progr (x) = — min 

^ yGR" 


||y-x|| + (V/(x),y-x)+g(y)-g(x) 


We often use the prox and Prog operators with the Lipschitz constant L/ of V/. In that case we simplify the 
notation to prox(x) = prox^^^ (x) and Prog(x) = Progj^^ (x) d. 

In case of the FR problem (and LASSO, in general) prox(x — -^V/(x)) becomes the shrinkage-thresholding 
operator TKx)^ = (|xi| — I/L)_|_sgn(xi) and the iterative scheme is given by. 


Xfc+l — T\/l ^ 


Xfc - — A^(Axfe - b) 


Then problem ( |Pi[ ) can be solved by the well-known Iterative Shrinkage Thresholding Algorithm (ISTA) and its 
generalizations, see, e.g. 0 , m, |20) . [25]. When the stepsize is fixed to the Lipschitz constant of the problem, 
ISTA reduces to the following simple iterative scheme. 


Xfe+i = prox(xfc). 

If g{-) = 0, the problem is smooth, prox(x) = x^ — -j^V/(xfe) and ISTA becomes the well-known steepest descent 
algorithm with 

Xfc+i = Xfc - V/(xfe). 

For composite optimization problems it is common to use the gradient mapping as an optimality measure [38] : 

D(x) = X — prox(x). 


It is a generalization of the gradient for the composite setting and preserves its most important properties used for 
the convergence analysis. 

We will make use of the following fundamental Lemma in our convergence analysis. The proof can be found, 
for instance, in [38] . 

^We give more details about smoothing non-smooth functions in general, and || • ||i in particular, at the end of this section. 

^ These definitions without loss of generality also cover the case when the problem is constrained and smooth. Also see m for the 
original definitions for the unconstrained smooth case. 
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Lemma 1 (Gradient Descent Guarantee). For any x G M", 

F(prox(x)) < F(x) — Prog(x). 

Using Lemmalll it can be shown that ISTA computes an e-optimal solution in terms of function values in 0{l/e) 
iterations (see [5] and references therein). 

2.3 Mirror Descent Methods 

Mirror Descent methods solve the general problem of minimizing a proper convex and locally Lipschitz continuous 
function F over a simple convex set Q by constructing lower bounds to the objective function (see for instance [34) . 
[39)1. Gentral to the definition of mirror descent is the concept of a distance generating function. 

Definition 1. A function w(x) ; Q —^ R is called a Distance Generating Function (DFG), if lo is 1-strongly convex 
with respect to a norm || • ||, that is 

uj{z) > u;(x) + {Vuj(x), z — x) + -||x — z|p Vx, z G Q. 

Accordingly, the Bregman divergence (or prox-term) is given as 

t4(z) = w(z) — (Vw(x), z — x) — uj{x) Vx, z G Q. 

The property of DFG ensures that i4(x) = 0 and t4(z) > ~ > 0- 

The main step of mirror descent algorithm is given by, 

Xk+i = Mirrxj,(V/(xfc),a), where Mirrx(^,a) = argmin{Ux(z) + (a^,z - x) + ag{z)}. 

z6R" 

Here again, we assume that Mirr can be evaluated in closed form. Note that choosing uj(z) = l/2||z||| as the DFG, 
which is strongly convex w.r.t. to the £ 2 "iiorm over any convex set and correspondingly 14 ;(z) = ^||x— z|||, the 
mirror step becomes exactly the proximal step of ISTA. However, regardless of this similarity, the gradient and 
mirror descent algorithms are conceptually different and use different techniques to prove convergence. The core 
lemma for mirror descent convergence is the so called Mirror Descent Guarantee. For the unconstraine41 composite 
setting it is given below, a proof can be found in [38] . 

Lemma 2 (Mirror Descent Guarantee). Let Xk+i = Mirrx^ (V/(xfc), a), then Vu G R” 

a(F(xk) - F{u)) < a(V/(xfc), Xfc - u) + a{g{xk) - ^(u)) 

< a^Lf Prog(xfe) + 14Ju) - Ux.+^u). 

Using the Mirror Descent Guarantee, it can be shown that the Mirror Descent algorithm converges to the 
minimizer x* in 0 {Lf/ e), where e is the convergence precision |9]. 

2.4 Accelerated First Order Methods 

A faster method for minimizing smooth convex functions with asymptotically tight convergence rate 

[34] was proposed by Nesterov in [37] . This method and its variants were later extended to solve the more general 
problem (P) (e.g. 0, [3S], P), see [15] for full review and analysis of accelerated first order methods. The main idea 
behind the accelerated methods is to update the iterate based on a linear combination of previous iterates rather 
than only using the current one as in gradient or mirror descent algorithms. The first and most popular method 
that achieved an optimal convergence rate for composite problems is FISTA 0 . At each iteration it performs one 

^Simple constraints can be included by defining g as an indicator function. 
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proximal operation, then uses a linear combination of its result and the previous iterator for the next iteration. 
Data: FISTA(/(-), g{-), xq) 

Choose e > 0. 

Set yi = Xq and ti = 1. 
for A: = 0,1, 2,... do 
Set Xfc+i = pro x(xfc) . 

Set 4+1 = 

Set yfc+i = Xfc + (xfc - Xfc_i). 


if ||D(xfc)||* < e then 
I return x^. 
end 
end 


Algorithm 1: FISTA 

Alternatively, Nesterov’s accelerated scheme can be modified to use two proximal operations at each iteration 
and linearly combine their results as the current iteration point |35j . In a recent work this approach was reinterpreted 
as a combination of gradient descent and mirror descent algorithms [1]. The algorithm is given next. 

Data: AGM(/(.), ff(-), xq) 

Choose e > 0. 

Choose DCF w and set I4(y) = w(y) — (Va;(x), y — x) — a;(x). 

Set yo = Zo = Xq. 
for A: = 0,1, 2,... do 

Set Ofe+i = ^ and 4 = 

Set Xfc = 4zfc + (1 - 4)yfe- 
if ||D(xfc)||* < e then 
I return Xfc. 
end 

Set yfc+i = prox(xfc). 

Set Zfc+i = MirrzjV/(xfc),afc+i). 

end 


Algorithm 2: ACM 

The convergence of Algorithm [2] relies on the following lemmas. 


Lemma 3. If tk = at+iLf satisfies that for every u G R", 


afc+i(V/(xfc), Zfc - u) + Q!fc+i(g(zfc) - g{u)) 

< al+iLf Prog(xfc) + 14, (u) - I4,+i (u) 

< afe+i^/(^(xfc) - ^(yfe+i)) + Vzfc(u) - I4,+i(u). 

Lemma 4 (Coupling). For any u G R" 

afe+iL/F(yfc.+i) - - ak+i)F{yk) + 

Theorem 1 (Convergence). Algorithm\^ ensures that 

Fiyr) - F(x*) < 

where 0 is any upper bound on 14^ (x*). 

Remark 1. In the analysis of the algorithm is done for smooth constrained problems, however it can be set to 
work for the unconstrained composite setting as well. 

Remark 2. Whenever Lf is not known or is expensive to calculate, we can use backtracking line search with the 
following stopping condition 

F{yk+i) < F{j<ik) - Prog+(xfc), (2) 

where L > 0 is the smallest number for which the condition (0) holds. The convergence analysis of the algorithm 
can be extended to cover this case in a straightforward way (see m)- 


(Pz,+i(u) - I4fc(u)) < afc.+iF(u). 

40Ly 
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2.5 Smoothing and First Order Methods 

A more general approach of minimizing non-smooth problems is to replace the original problem by a sequence of 
smooth problems. The smooth problems can be solved more efficiently than by using subgradient type methods 
on the original problem. In [35] Nesterov proposed a first order smoothing method, where the objective function 
is of special max-type. Then Beck and Teboulle extended this method to a more general framework for the class 
of so-called smoothable functions [^. Both methods are proven to converge to an e-optimal solution in 0{l/e) 
iterations. 

Definition 2. Let (;:£—>■ (— 00 , 00 ] he a closed and proper convex function and let x C domg be a 
set. The function g is called ’’(a, fS, K)-smoothable” over X, if there exist Pi, P 2 satisfying Pi + P 2 
that for every g, > 0 there exists a continuously differentiable convex function ( 7 ^ : E —>■ (— 00 , 00 ) 
following hold: 

1. g{x.) - Pig < 5 ^(x) < ^(x) -|- P 2 g for every x G A. 

2. The function has a Lipschitz gradient over X with a Lipschitz constant such that there exists K > 0, a > 0 
such that 

||Vg^(x) - Vg^(y)||, < (^K + ^ ||x- yjj for every x,y G A. 


closed convex 
= /3 > 0 such 
such that the 


We often use = K + ^. 

It was shown in that with an appropriate choice of the smoothing parameter g a solution obtained by 
smoothing the original non-smooth problem and solving it by a method with 0{l/^/e) convergence rate finds an 
e-optimal solution in 0[l/e) iterations. When the proximal step computation is tractable, accelerated first order 
methods are superior both in theory and in practice. However, for the purposes of establishing a step-size strategy 
for MAGMA it is convenient to work with smooth problems. We combine the theoretical superiority of non-smooth 
methods with the rich mathematical structure of smooth models to derive a step size strategy that guarantees 
convergence. MAGMA does not use smoothing in order to solve the original problem. Instead, it uses a smoothed 
version of the original model to compute a step size when the search direction is given by the coarse model (coarse 
correction step). 

For the FR problem (and ( |PiD in general), where g(x) = A||x||i, it can be shown [B] that 

n 

5m(x) = (3) 

is a ^-smooth approximation of g(x) = A||x||i with parameters (A,An, 0). 


3 Multi-level Accelerated Proximal Algorithm 

In this section we formally present our proposed algorithm within the multi-level optimization setting, together 
with the proof of its convergence with OlfVjsfe) rate, where e is the convergence precision. 

3.1 Model Construction 

First we present the notation and construct a mathematical model, which will later be used in our algorithm 
for solving O- A well-defined and converging multi-level algorithm requires appropriate information transfer 
mechanisms in between levels and a well-defined and coherent coarse model. These aspects of our model are 
presented next. 

3.1.1 Information Transfer 

In order to transfer information between levels, we use linear restriction and prolongation operators R : and 

P : R"hX" respectively, where uh < n is the size of the coarse level variable. The restriction operator R transfers 
the current fine level point to the coarse level and the prolongation operator P constructs a search direction for the 
fine level from the coarse level solution. The techniques we use are standard (see [13]) so we keep this section brief. 
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There are many ways to design R and P operators, but they should satisfy the following condition to guarantee 
the convergence of the algorithm, 

crP = RT, 

with some scalar cr > 0. Without loss of generality we assume cr = 1, which is a standard assumption in the 
multi-level literature m, 1331- 

In our applications, when the problem is given as (El), we use a simple interpolation operator given in the form 
of (l25)l as the restriction operator, which essentially constructs linear combinations of the columns of dictionary A. 
We give more details on using R for our particular problem in Section 01 

3.1.2 Coarse Model 

A key property of the coarse model is that at its initial point :x.h,o the optimality conditions of the two models 
match. This is achieved by adding a linear term to the coarse objective function: 

Fh{^h) = + 9h{^h) + (v_f/, X//), (4) 

where the vector vh G K"" is defined so that the fine and coarse level problems are first order coherent, that is, 
their first order optimality conditions coincide. In this paper we have a non-smooth objective function and assume 
the coarse model is smooth 0 with L//-Lipschitz continuous gradient. Furthermore, we construct vh such that the 
gradient of the coarse model is equal to the gradient of the smoothed fine model’s gradient: 

VFff(Rx) = RVF^(x), (5) 

where F'^(x) is a /r-smooth approximation of F(x) with parameters (a, /3, K). Note that for the composite problem 
m F'^(x) = /(x) -|-g^(x), where is a /i-smooth approximation of ^(x). In our experiments for (El we use 
as given in (0. The next lemma gives a choice for vh, such that ([S]) is satisfied. 

Lemma 5 (Lemma 3.1 of |46j 1. Let Fh be a Lipsehitz continuous function with Lh Lipschitz constant, then for 

VH = RVF^(x) - (V/ff (Rx) + Vf/ff (Rx)) ( 6 ) 


we have 

VFff(Rx) = RVF^(x). 

The condition ([5l) is referred to as first order coherence. It ensures that if x is optimal for the smoothed fine 
level problem, then Rx is also optimal in the coarse level. 

While in practice it is often beneficial to use more than one levels, for the theoretical analysis of our algorithm 
without loss of generality we assume that there is only one coarse level. Indeed, assume that G is a 

linear operator that transfers x from R^ to R^. Now, we can define our restriction operator as R = 0^=1 
where hi = n, hi+i = Hi and Hi = nn- Accordingly, the prolongation operator will be P = rii=;^ where 
Note that this construction satisfies the assumption that crP = R^, with a = Hti o'*- 

3.2 MAGMA 

In this subsection we describe our proposed multi-level accelerated proximal algorithm for solving ([T}. We call 
it MAGMA for Multi-level Accelerated Gradient Mirror descent Algorithm. As in [35] and [I], at each iteration 
MAGMA performs both gradient and mirror descent steps, then uses a convex combination of their results to 
compute the next iterate. The crucial difference here is that whenever a coarse iteration is performed, our algorithm 
uses the coarse direction 

yfc+i = Xfc -I- Sfcdfc(xfc), 

instead of the gradient, where dfc(-) is the search direction and Sk is an appropriately chosen step-size. Next we 
describe how each of these terms is constructed. 

At each iteration k the algorithm first decides whether to use the gradient or the coarse model to construct a 
search direction for the gradient descent step. This decision is based on the optimality condition at the current 
iterate x^: we do not want to use the coarse direction when it is not helpful, i.e. when 

"^This can be done by smoothing the non-smooth coarse term. 



• the first order optimality conditions are almost satisfied, or 

• the current point is very close to the point x, where a coarse iteration was last performed, as long as the 
algorithm has not performed too many gradient correction steps. 

More formally, we choose the coarse direction, whenever both of the following conditions are satisfied: 

l|R-V^;xfc(xfc)|U > «:||VF^,(xfc)||*, 

||xfc - x|| > i?||x|| or q<Kd, 

where k £ (0, min (1, min ||R||)), d > 0 and are predefined constants, and q is the number of consecutive gradient 
correction steps m, na¬ 
if a coarse direction is chosen, a coarse model is constructed as described in (|1|). In order to satisfy the coherence 
property ((S]), we start the coarse iterations with x^f^o = Rxfc, then solve the coarse model by a first order method 
and construct a search direction: 

dfc(x/c) = F{x.h,Nh — Rxfc), (8) 

where ^h,Nh ^ is an approximate solution of the coarse model after Nh iterations. Note that in practice we 
do not find the exact solution, but rather run the algorithm for Nh iterations, until we achieve a certain acceptable 
^h,Nh ■ In onr experiments we used the monotone version of FISTA [3], however in theory any monotone algorithm 
will ensure convergence. 

We next show that any coarse direction defined in ([5]) is a descent direction for the smoothed fine problem. 

Lemma 6 (Descent Direction). If at iteration k a coarse step is performed and suppose that Fh{'x.h,Nh) < 
Fh{^h,i), then for any x^ it holds that 


(dfc(xfe),VF^,(xfe)) < -^||VF^,(xfc )||2 < 0 . 

Proof. Using the definition of coarse direction d(xfc), linearity of R and Lemma Owe obtain 

(dfe(xfc), VF^,,(xfc)) = {TC{-Xh,Nh - VF^,(xfc)) 

= {^H,Nh - ^H, 0, (Xfc)) 

= {'^H.Nh — Xff,o, VFij(xjy_o)) 

< Fh{^H.Nh) — Fh{x.H,o), 

where for the last inequality we used the convexity of Fh- On the other hand using the monotonicity assumption 
on the coarse level algorithm and lemma (HD we derive: 

Fh{^h,Nh) — Fh{^h.o) < Fh{xh,i) — Fh{xh,o) < —^ 77 — IIVF //( x //_ o )||*- ( 10 ) 

ZLh 

Now using the choice xh,o = Rx^ and we obtain: 

VF//(x//,o) = VFf/(Rxfc) = RVF^j^(xfc). (11) 

Then combining (ITUl) . (fTTl) and condition (0 we obtain 

Ff/(xi/^WH) - Ff/(xi/,o) < -^^l|VRF^Jxfc)||^ < -^^||VF^Jxfc)||^ (12) 

Finally, combining (jOD and (1121) we derive the desired result. □ 

After constructing a descent search direction, we find suitable step sizes for both gradient and mirror steps at 
each iteration. In [1] a fixed step size of au+i = 7 ^ is used for gradient steps. However, in our algorithm we do not 
always use the steepest descent direction for gradient steps, and therefore another step size strategy is required. We 
show that step sizes obtained from Armijo-type backtracking line search on the smoothed (fine) problem ensures 
that the algorithm converges with optimal rate. Starting from a predefined Sk = s*^, we reduce it by a factor of 
constant r € ( 0 , 1 ) until 

F^fc(yfc-Hi) < F^,(xfe)-hcsfe(dfe(xfc),VF^,(xfc)) (13) 
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is satisfied for some predefined constant c € (0,1). 

For mirror steps we always use the current gradient as search direction, as they have significant contribution 
to the convergence only when the gradient is small enough, meaning we are close to the minimizer, and in those 
cases we do not perform coarse iterations. However, we need to adjust the step size ak for mirror steps in order to 
ensure convergence. Next we formally present the proposed algorithm. 

Data: MAGMA(/(.), g{-), xq, T) 

Set yo = zq = Xq and ao = 0 
for fc = 0,1,2,..., T do 

Choose rjk+i and Ofc+i according to m and (l2^ . 

Set tk = - - -. 

ak+irik+i 

Set Xfc = tfcZfc + (1 - tk)yk- 
if Conditions ^ are satisfied then 
Perform Njj coarse iterations. 

Set dfe = P{:x.h,Nh — R'X:^). 

Set yfc+i = Xfc + Sfcd(xfc), with Sk satisfying (US]), 
end 
else 

I Set yfc+i = prox(xfe). 

end 

Set Zfc+i = MirrzjV/(xfc),afc+i). 

end 

Algorithm 3: MAGMA 

Now we show the convergence of Algorithm [3] with OiXj^/e) rate. First we prove an analogue of Lemma[3l 
Lemma 7. For every u € R” and rjk > 0 and 


Vk+l = 


f Lf, 

when k + 1 is 

y max < 

r 1 Lh \ 

4a^i7fc ’ cskK^ J 


otherwise 


it holds that 


Q:fe+i(V/(xfe), Zfc - u) + afe+i(g(zfc) - g{u)) 

<al+iLf Prog(xfe) + (u) - (u) 

<ak+ilgk+i(F^k(^k) - F^^(yk+i)) + Lffi^k] + V^^{u) - K,+i(u) 

Proof. First note that the proof of the first inequality follows directly from Lemma 4.2 of [1]. Now assume k 
is a coarse step, then the first inequality follows from Lemma [21 To show the second one we hrst note that for 
X = prox(xfc), 

Prog(xfc)-^||VF^,(xfc)f 
= - ^I|x-Xfe|p - (V/(xfe),X-Xfc) - 
< - ^l|x- Xfcf - (V/(xfe),x-Xfc) - 
= - ^(11* - + 2(ic - Xfc, 

</3gk- 

Here we used the definitions of Prog, prox and as well as the convexity of g^. Therefore from Lemma [S] and 
backtracking condition (1131) we obtain that if fc is a coarse correction step, then 

Lf Prog(xfc) < i||VF^Jxfc)f+ Lj^^fc 

Lh 

< -^(dfc(xfc), (xfc)) + Lffink 

Ki 

< - PnAyk+i)) + Lffigk; 
cs ^ 


g(x) +g(xfe) - ^||VF/,Jxfc)f 


(x - Xfc, y/g,,^ (xfe)) + /3/ifc - 


1 


|VF^,(xfc)|p 


(xfc)) + ||-^VF^,(xfc)f) +/3^fc 
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Vk+l = 


Otherwise, if fc is a gradient correction step, then from Lemma [T] 

L/Prog(xfc) < L/(i^(xfc) -F(yfe+i)) < Lf{F^^{xk) - (yfe+i)) + L//3/rfc. 

Now choosing 

Lf , when fc + 1 is a gradient correction step 
max \ f, otherwise 

f 4 a^? 7 fc ’ cSfcK^ j ’ 

we obtain the desired result. □ 

Remark 3. The recurrent choice ofrik+i may seem strange at this point, as we could simply set it to max{ 
however forcing r/k+i > 1 — 5 — helps us ensure that tk £ ( 0 , 1 ] later. 

Lemma 8 (Coupling). For any u £ R" and tk = 1/ak+iPk+i, where pk+i is defined as in Lemma^ it holds that 

afc+i%+iF(yfc+i) - {al^j^T]k+i - afc+i)F(yfc) + (K,+i(u) - K,(u)) 

< afe+iF(u) + [Lf + r]k+i)al+il3Tk- (14) 

Proof. First note that, if fc is a gradient correction step, then the proof follows from Lemma ID Now assume fc is a 
coarse correction step, then using the convexity of / we obtain 

F(xfc) - F(u) = /(xfc) - /(u) + 5 (xfc) - g{u) 

< (V/(Xfc),Xfc - U) +5(xfc) -5(u). 

On the other hand, from the definition of 5 ^^ and its convexity we have 

g(xfe) - g(zfc) < (xfc) + I3ip.k - (z^) + /32/ifc 

< (V5/xfc(xfc)iXfc - Zfc) +/3^fc. 

Then using (fTO . we can rewrite g(xfc) — g{u) as 

g(xfc) - g(u) = (g(xfc) - g{zk)) + (sK) “ 5 (u)) 

< (V 5 Mfc(xfc),Xfc - Zfc) +I3gk + {g{zk) - g{u)). 

Now rewriting (V/(xfc),Xfc — u) = (V/(xfe),Xfe — Zk) + (V/(xfc),Zfc — u) in (ITSl) and using (ITT)) we obtain 


(15) 


(16) 


(17) 


F(xfe) - F(u) < (VF^JxA,),Xfc - Zfc) + Pfik + (V/(xfc),Zfc - u) + {g{zk) - ff(u)), 

where we used that VF^;^(xfc) = V/(xfc) + Vg^j,(xfc). 

From the choice tfc(xfc — Zfc) = (1 — tk){yk — Xfc) and convexity of F^^. we have 


(18) 


(VF^Jxfc),Xfc - Zfc) = 


< 


< 


(1-tfc) 

tk 

(1-4) 

tk 

(1-4) 

tk 


{'^Fij.ki^k),yk - Xfc) 

(yfc )--?)*»= (xfc)) 

(F(yfc) - F(xfc)) + -^/3/rfc, 

Cfc 


where for the last inequality we used that F^j, is a/Xfc-smooth approximation of F. Then choosing tfc = l/(afc-)-i? 7 fc+i), 
where ctfc+i is a step-size for the Mirror Descent step and ? 7 fc+i is defined in Lemma [71 we have 

(VF^^ (xfc), Xfc - Zfc) < (afc+i77fc+i - l)(F(yfc) - F(xfc)) -h (afc.+i77fc-+i - l)/3/rfc. (19) 

Plugging (IT^ into Cl we obtain 

afc+i(F(xfc) - F(u)) < (a^+i? 7 fc+i - afc+i)(F(yfc) - F(xfc)) -h al_^_lgk+l|3^J.k 

-hafc.+i(V/(xfc),Zfc - u) + ak+iig{zk) - g{u)). (20) 

Finally, we apply Lemma [7] on (1201) and obtain the desired result after simplifications: 

afc+i(F(xfc) - F(u)) < {al^^pk+i - afc+i)(F(yfc) - F(xfc)) 

+ a^+i[r?fc-+i(F(xfc) - F(yfc+i)) -1- (F/ -h gk+i)l3tik] + 14Ju) - 14fc+i(u). 


□ 
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Theorem 2 (Convergence). After T iterations of Algorithm without loss of generality assuming that the last 
iteration is a gradient correction step for 


Vk+l = 


Lf, when k + 1 is a gradient correction step 

max < i : otherwise 

4 aJ? 7 fc ’ csfcK'^ J ’ 


( 21 ) 


and 


it holds that 


OLk+l = 



when k + \ is a gradient correction step 
—, otherwise 


Fiyr) - F{^*) < O 


rj^2 


( 22 ) 


Proof. By choosing rjk+i and ak+i according to (I^Tl) and (1^^ . we ensure that alpk = al_^_i'r]k+i — cifc+i + 
and tk = 1 /(ak+itik+i) S (0,1]. Then telescoping Lemma[5] with fc = 0,1,..., T — 1 we obtain 


T-l 


^[al+iVk+iFijk+i) - aligkF{Yk) + + Kt(u) - Ko(u) 


fc =0 


T-l 


< + {Lf + r]k+i)al+iPlik\- 


k—0 


Or equivalently, 


T-l 


arVTFiyr) - ao^o-F(yo) + T—^iyk) + V^^{u) - Czo(u) 

k^o 


T-l 


< [<^k+iF{u) + {Lf + r]k+i)al+iPfj.k] , 


k=0 

Then choosing u = x* and noticing that F{yk) > F{x*) and 14 t(u) > 0, for an upper bound 0 > 14o(x*)) we 
obtain 

T-l.. ^ 


(XTVTFiyr) - alr]oF{yo) < 0 + 


k=0 


Oik+1 


4r?fc+i 


F(x*) + {Lf + r]k+i)ak+iPLk 


Now using the fact that ak+i — = Q^fc+ii?fc+i ~ (^’k'Hk, for A: = 0,1,..., T — 1 we simplify to 

T-l 

aTVTFiyr) - alr]oF{yo) < 0 + a^PTFi^*) - a:lr]oF{x*) + ^^(L/ + r]k+i)al^-yPp.k. 

/i-O 

Then defining ao = 0 r]o = Lf and using the property of Bregman distances we obtain 

T-l 

arVT{F{yT) - F{x*)) < 0 + ao7?o(-F(yo) - F{x*)) + '^{Lf+ r]k+l)al_^_l|3^lk 




T-l 


< 0 + ^ + Vk+i)al^ifitJ.k- 


k=0 


Then assuming that the last iteration T is a gradient correction ste]:H and using the definitions of ax and rjT 
for the left hand side, we can further simplify to 


{T + lf 
iLf 


T-l 


{Fiyr) - F{x*)) < 0 + (T/ + r]k+i)al+iPiik, 


k=0 


^This assumption does not lose the generality, as we can always perform one extra gradient correction iteration 
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or equivalently 


Q + J2kJiLf + Vk+i)al+il3f^k 
(r + i)2 


F{yT)-F{x*)<4Lf 


Finally, choosing fik S ^0, ^ for a small predefined ( S (0,1], we obtain; 


Fiyr) - F(x*) < 


4 ^/(e + C) 

(T+l )2 ■ 


□ 


Remark 4. Clearly, the constant factor of our Algorithm’s worst case convergence rate is not better than that of 
ACM, however, as we show in the next section, in practice MAGMA can be several times faster than ACM. 


4 Numerical Experiments 

In this section we demonstrate the empirical performance of our algorithm compared to FISTA and a variant of 
MISTA [?T] on several large-scale face recognition problems. We chose those two particular algorithms, as MISTA is 
the only multi-level algorithm that can solve non-smooth composite problems and FISTA is the only other algorithm 
that was able to solve our large-scale problems in reasonable times. The source code and data we used in the 
numerical experiments is available from the webpage of the second author, www.doc.ic.ac.uk/^pp500/magma.html 
In Section 1131 we also compare FISTA and MAGMA to two recent proximal stochastic algorithms: (APCG [ST]) 
and SVRG [50]. We chose Face Recognition (FR) as a demonstrative application since (a) FR using large scale 
dictionaries is a relatively unexplored problem and (b) the performance of large scale face recognition depends 
on the face resolution. 

4.1 Robust Face Recognition 

In [47] it was shown that a sparse signal x G K." from highly corrupted linear measurements b = Ax -f e G R", 
where e is an unknown error and A G jg ^ highly correlated dictionary, can be recovered by solving the 

following £i-minimization problem 


min ||x||i-I-||e||i subject to Ax + e = b. 

xGR",eeR'" 

It was shown that accurate recovery of sparse signals is possible and computationally feasible even for images with 
nearly 100% of the observations corrupted. 

We introduce the new variable w = [x^, G and matrix B = [A, I] G where I G is 

the identity matrix. The updated (El problem can be written as the following optimization problem: 

min i||Bw-b||^-hA||w||i. (23) 

wGR"*+" 2 

A popular application of dense error correction in computer vision, is the face recognition problem. There A 
is a dictionary of facial images stacked as column vector^j, b is an incoming image of a person who we aim to 
identify in the dictionary and the non-zero entries of the sparse solution x* (obtained from the solution of (12311 
w* = [x*^,e*^]^) indicate the images that belong to the same person as b. As stated above, b can be highly 
corrupted, e.g. with noise and/or occlusion. 

In a recent survey Yang et. al. |51j compared a number of algorithms solving the face recognition problem 
with regard to their efficiency and recognition rates. Their experiments showed that the best algorithms were the 
Dual Proximal Augmented Lagrangian Method (DALM), primal dual interior point method (PDIPA) and LILS 
[29] . however, the images in their experiments were of relatively small dimension - 40 x 30 pixels. Gonsequently the 
problems were solved within a few seconds. It is important to notice that both DALM and PDIPA are unable to 
solve large-scale (both large number of images in A and large image dimensions) problems, as the former requires 
inverting AA^ and the later uses Newton update steps. On the other hand LILS is designed to take advantage of 
sparse problems structures, however it performs significantly worse whenever the problem is not very sparse. 

®FR using large scale dictionaries is an unexplored problem in £i optimization literature due to the complexity of the E problem. 

^Facial images are, indeed, highly correlated. 
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4.2 MAGMA for Solving Robust Face Recognition 

In order to be able to use the multi-level algorithm, we need to define a coarse model and restriction and prolongation 
operators appropriate for the given problem. In this subsection we describe those constructions for the face 
recognition problem used in our numerical experiments. 

Creating a coarse model requires decreasing the size of the original fine level problem. In our experiments we only 
reduce n, as we are trying to improve over the performance of first order methods, whose complexity per iteration 
is 0{n^). Also the columns of A are highly correlated, which means that reducing n loses little information about 
the original problem, whereas it is not necessarily true for the rows of A. Therefore, we introduce the dictionary 
Ah € with uh < n. More precisely, we set Ah = ARj, with Rj, defined in (I25p . With the given 

restriction operator we are ready to construct a coarse model for the problem (|23p : 

min -\\[Ah I] w//- b||^ 4-A ^ \/f^H+^Hj + (24) 

J =1 

where wh = [x^, and vh is defined in ([6]). It is easy to check that the coarse objective function and gradient 
can be evaluated using the following equations: 

Fniv^n) =-^WAnyiH+ e-h\\l +X ^ ,w//). 


VFh{wh) = 


Ah 





X// 


[Ajb] 


e 


b 




where V(?//(w//) is the gradient of gn defined in ([3|) and is given elementwise as follows: 

Xwh.j 




= Aj = 1, 2,.. .,nH- 




Hence we do not multiply matrices of size m x (m + nn), but only m x uh- 
We use a standard full weighting operator m as a restriction operator, 


R 


X 


2 1 0 0 0 0 

0 12 10 0 
0 0 0 1 2 1 

0 ... 0 0 1 


... 0 

... 0 

... 0 

0 

2 1 


(25) 


and its transpose for the prolongation operator at each level i G {1, ...,F[ — 1}, where ni = n, n-i+i = and FI is 
the depth of each V-cycle. However, we do not apply those operators on the whole variable w of the model (I23L 
as this would imply also reducing m. We rather apply the operators only on part x of w = [x^,e^]^, therefore 
our restriction and prolongation operators are of the following forms: R = [Ra;,!] and P = R^ = [Rj,I]^. 

In all our experiments, for the Mirr operator we chose the standard Euclidean norm || • ||2 and accordingly i|| • II 2 
as a Bregman divergence. 


4.3 Numerical Results 

The proposed algorithm has been implemented in MatLab@ and tested on a PC with Intel© Core"’’^ i7 CPU 
(3.4GHzx8) and 15.6GB memory. 

In order to create a large scale FR setting we created a dictionary A of up to 8, 824 images from more than 
4, 000 different peopled by merging images from several facial databases captured in controlled and in uncontrolled 
conditions. In particular, we used images from MultiPIE [27], XM2VTS [32] and many in-the-wild databases such 
as LEW [28], LFPW [8] and HELEN [30]. In [51] the dictionary was small hence only very low-resolution images 
were considered. However, in large scale FR having higher resolution images is very beneficial (e.g., from a random 
cohort of 300 selected images, low resolution images of 40 x 30 achieved around 85% recognition rate, while using 

®Some people had up to 5 facial images in the database some only one. 
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440 images 

944 images 

8824 images 

MISTA 

7 

2 

2 

MAGMA 

7 

7 

13 


Table 1: Number of coarse levels used for MISTA and MAGMA for each dictionary 



440 images 

944 images 

8824 images 

FISTA 

98.69 

175.77 

1753 

MISTA 

68.77 

70.4 

1302 

MAGMA 

25.73 

29.62 

481 


Table 2: Average CPU running times (seconds) of MISTA, FISTA and MAGMA 


images of 200 x 200 we went up to 100%). Hence, in the remaining of our experiments the dictionary used is [A l] 
of 40,000 X 48, 824 and some subsets of this dictionary. 

In order to show the improvements of our method with regards to FISTA as well as to MISTA, we have designed 
the following experimental settings 

• Using 440 facial images, then [A l] is of 40,000 x 40,440 

• Using 944 facial images, then [A l] is of 40, 000 x 40, 944 

• Using 8,824 facial images, then [A l] is of 40,000 x 48,824 

For each dictionary we randomly chose 20 different input images b that are not in the dictionary and ran the 
algorithms with 4 different random starting points. For all of those experiments we set the parameters of the 
algorithms as follows: k = 0.9, 0.8 and 0.7 respectively for experiment settings with 440, 944 and 8 , 824 images 
in the database, Kd = 30 for the two smaller settings and Kd = 50 for the largest one, the convergence tolerance 
was set to e = 10“® for the two smaller experiments and e = 10~^ for the largest one, A = 10“®, r = 0.95 and 

= 10. For larger dictionaries we slightly adjust k and Kd in order to adjnst for the larger problem size, giving 
the algorithm more freedom to perform coarse iterations. In all experiments we solve the coarse models with lower 
tolerance than the original problem, namely with tolerance 10 “^. 

For all experiments we nsed as small coarse models as possible so that the corresponding algorithm could 
produce sparse solutions correctly identifying the sought faces in the dictionaries. Specifically, for experiments on 
the largest database with 8824 images we nsed 13 levels for MAGMA (so that in the coarse model Ah has only 
2 columns) and only two levels for MISTA, since with more than two levels MISTA was unable to produce sparse 
solutions. The number of levels used in MAGMA and MISTA are tabulated in Table [TJ 

In all experiments we measure and compare the GPU running times of each tested algorithm, because comparing 
the number of iterations of a multi-level method against a standard first order algorithm would not be fair. This 
is justified as the multilevel algorithm may need fewer iterations on the fine level, but with a higher computational 
cost for obtaining each coarse direction. Comparing all iterations including the ones in coarser levels would not be 
fair either, as those are noticeably cheaper than the ones on the fine level. Even comparing the nnmber of objective 
function values and gradient evaluations would not be a good solution, as on the coarse level those are also 
noticeably cheaper than on the fine level. Furthermore, the multilevel algorithm requires additional computations 
for constructing the coarse model, as well as for vector restrictions and prolongations. Therefore, we compare the 
algorithms with respect to CPU time. 

Figures [T^ [lb] and [T^ show the relative CPU running times until convergence of MAGMA and MISTA compared 
to FISTA]^. The horizontal axis indicate different inpnt images b, whereas the vertical lines on the plots show the 
respective highest and lowest values obtained from different starting points. Furthermore, in Table|2|we present the 
average CPU times required by each algorithm until convergence for each problem setting, namely with databases 
of 440, 944 and 8824 images. As the experimental resnlts show, MAGMA is 2 — 10 times faster than FISTA. On 
the other hand, MISTA is more problem dependant. On some instances it can be even faster than MAGMA, but 
on most cases its performance is somewhere in between FISTA and MAGMA. 

In order to better understand the experimental convergence speed of MAGMA, we measnred the error of 
stopping criteri£0 and objective function values generated by MAGMA and FISTA over time from an experiment 

®We only report the performance of FISTA, as AGM and FISTA demonstrate very similar performances on all experiments. 

^®We use the norm of the gradient mapping as a stopping criterion 
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(a) CPU times with 440 images 


(b) CPU times with 944 images (c) CPU times with 8824 images 


Figure 1: Comparing MAGMA and MISTA with FISTA. Relative CPU times of MAGMA and MISTA 
compared to FISTA on face recognition experiments with dictionaries of varying number of images (440, 944 and 
8824). For each dictionary 20 random input images b were chosen, which indicate the numbers on horizontal axis. 
Additionally, the results with 4 random starting points for each experiment are reflected as horizontal error bars. 




CPU time (sec); (log-scale) CPU time (sec); (log-scale) 

(a) Gradient mapping (b) Function value 

Figure 2: MAGMA and FISTA convergences over time. Comparisons of gradient mappings and function 
values of MAGMA, MISTA and FISTA over time. 


with 440 images in the dictionary. The stopping criteria are log-log-plotted in Figure Ea] and the objective function 
values - in Figure [2bl In all figures the horizontal axis is the GPU time in seconds (in log scale). Note that 
the coarse correction steps of MAGMA take place at the steep dropping points giving large decrease, but then 
at gradient correction steps first large increase and then relatively constant behaviour is recorded for both plots. 
Overall, MAGMA reduces both objective value and the norm of the gradient mapping significantly faster. 

4.4 Stochastic Algorithms for Robust Face Recognition 

Since (randomized) coordinate descent (GD) and stochastic gradient (SG) algorithms are considered as state of the 
art for general l\ regularized least squared problems, we finish this section with a discussion on the application 
of these methods to the robust face recognition problem. We implemented two recent algorithms: accelerated 
randomised proximal coordinate gradient method (APCG)E3 [SI] and proximal stochastic variance reduced gradient 
(SVRG) [50]. We also tried block coordinate methods with cyclic updates [7], however the randomised version of 
block coordinate method performs significantly better, hence we show only the performances of APGG and SVRG 
so as not to clutter the results. 

In our numerical experiments we found that CD and SG algorithms are not suitable for robust face recognition 
problems. There are two reasons for this. Firstly, the data matrix contains highly correlated data. The correlation 
is due to the fact that we are dealing with a fine-grained object recognition problem. That is, the samples of the 
dictionary are all facial images of different people and the problem is to identify the particular individual. The 
second reason is the need to extend the standard li regularized least squared model so that it can handle gross 

implemented the efficient version of the algorithm that avoids full vector operations. 
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(a) 40, 000 X 100 data dimen- 

sinn 


(b) 40, 000 X 8, 000 data 

Fppnsinn 


di- (c) 40, 000 X 100 data dimen- 

sjnTT Hiirk-pf mnHpl 


(d) 40, 000 X 8,000 data di- 

Fppnsinn* Hiirlcpt mnHpl 



Time (seconds) 




n n 

’ \ I \i 


i^.h 

iE- 




Time (seconds) 


(e) 40,000 X 100 data dimen- (f) 40, 000x8, 000 data dimen- (g) 40,000 x 100 data dimen- (h) 40,000 x 8,000 data di- 
sion sion sion; bucket model mension; bucket model 


Figure 3: Comparing FISTA, APCG and SVRG after running them for a fixed amount of time with ((c), (d), (g) 
and (h)) and without ((a), (b), (e) and (f)) the bucket model on randomly generated data. The first row contains 
sequences of the objective values and the bottom row contains optimality criteria, i.e gradient mappings. 


errors, such as occlusions. It can be achieved by using the dense error correction formulation m in (l23ll (also 
referred to as the “bucket” model in [47]). 

To demonstrate our argument we implemented APCG and SVRG and compared them with FISTA and our 
method - MAGMA. We run all four algorithms for a fixed time and compare the achieved function values. For 
APGG we tried three different block size strategies, namely 1, 10 and 50. While 10 and 50 usually performed 
similarly, 1 was exceptionally slow, so we report the best achieved results for all experiments. For SVRG one has 
to choose a fixed step size as well as a number of inner iterations. We tuned both parameters to achieve the best 
possible results for each particular problem. 

We performed the experiments on 5 databases: the 3 reported in the previous subsection (with 440, 944 and 
8824 images of 200 x 200 dimension) and two “databases” with data generated uniformly at random with dimensions 
40,000 X100 and 40,000 x 8,000. The later experiments are an extreme case where full gradient methods (FISTA) are 
outperformed by both APGG and SVRG when on standard ^i-regularized least squared problems, while MAGMA 
is not applicable. The results are given in Figure |3| As we can see from Figures [3^ and |3b| all three methods can 
achieve low objective values very quickly for standard ii regularized least squared problems. However, after adding 
an identity matrix to the right of database (dense error correction model) the performance of all algorithms changes: 
they all become significantly slower due to larger problem dimensions iFigures [3cl and [3d)) . Most noticeably SVRG 
stagnates after first one-two iterations. For the smaller problem (Figure |3^ FISTA is the fastest to converge to a 
minimizer, but for the larger problem (Figure |3d| APGG performs best. The picture is quite similar when looking 
at optimality conditions instead of objective function values in Figures |3e] to I3hl all three algorithms, especially 
SVRG are slower on the dense error correction model and for the largest problem APGG performs best. 

This demonstrates that for £i-regularized least squares problems partial gradient methods with a random 
dictionary can often be faster than accelerated full gradient methods. However, for problems with highly correlated 
dictionaries and noisy data, such as the robust face recognition problem, the picture is quite different. 

Having established that APCG and SVRG are superior than FISTA for random data with no gross errors, we 
turn our attention to the problem at hand, i.e. the robust face recognition problem. First we run FISTA, APCG, 
SVRG and MAGMA on problems with no noise and thus the bucket model (l23l) is not used. We run all algorithms 
for 100 seconds on a problem with n = 440 images of m = 200 x 200 = 40, 000 dimension in the database. As 
the results in Figure H] show all algorithms achieve very similar solutions of good quality with APCG resulting in 
obviously less sparse solution (Figure [3^. 

Then we run all the tested algorithms on the same problem, but this time we simulated occlusion by adding a 
black square on the incoming image. Thus, we need to solve the dense error correction optimisation problem (1231) . 
The results are shown in Figure |5| The top row contains the original occluded image and reconstructed images by 
each algorithm and the middle row contains reconstructed noises as determined by each corresponding algorithm. 
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(f) FISTA (g) APCG (h) SVRG (i) MAGMA 

Figure 4: Result of FISTA, APCG, SVRG and MAGMA in an image decomposition example without occlusion. 
Top row ((a) to (e)) contains the original input image and reconstructed images from each algorithm. The bottom 
row ((f) to (i)) contains the solutions x* from each corresponding algorithm. 


FISTA (Figure [5b| and MAGMA (Figure [5e]) both correctly reconstruct the underlying image while putting noise 
(illumination changes and gross corruptions) into the error part (Figures [52 and [5i|. APCG (Figure |5c]), on the 
other hand, reconstructs the true image together with the noise into the error part (Figure [5^, as if the sought 
person does not have images in the database. SVRG seems to find the correct image in the database, however it 
fails to separate noise from the true image iFigures I5dl and IShl) . We also report the reconstruction vectors x* as 
return by each algorithm in the bottom row. FISTA (Figure and MAGMA (l5mD both are fairly sparse with a 
clear spark indicating to the correct image in the database. SVRG (Figure |5l| also has a minor spark indicating to 
the correct person in the database, however it is not sparse, since it could not separate the noise. The result from 
APGG (Figure HQ is the worst, since it is not sparse and indicates to multiple different images in the database, 
thus failing in the face recognition task. 

Note that this is not a specifically designed case, in fact, SVRG, APCG and other algorithms that use partial 
information per iteration might suffer from this problem. Indeed, APGG uses one or a few columns of A at a time, 
thus effectively transforming the original large database into many tiny ones, which cannot result in good face 
recognition. SVRG, on the other hand, uses one row of A and all of variable x at a time for all iterations, which 
allows it to solve the recognition task correctly. However, when the dense error correction optimisation problem in 
(1231) is used the minibatch approach results in using only one row of the identity matrix with each corresponding 
row of A and one coordinate from variable e together with x, therefore resulting in poor performance when there 
is noise present. 

To further investigate the convergence properties of the discussed algorithms, we plot the objective function 
values achieved by APGG, SVRG, FISTA and MAGMA on problems with 440, 944 and 8824 images in the database 
in Figure l6p^ . As we can see from Figures [6a1 and [6bl APGG is slower than FISTA for smaller problems and slightly 
outperforms it only for the largest dictionary (Figure |6Q. SVRG on the other hand quickly drops in function after 
one-two iterations, but then stagnates. MAGMA is significantly faster than all three algorithms in all experiments. 

However, looking at the objective function value alone can be deceiving. In order to obtain a full picture of the 
performance of the algorithms for the face recognition task, one has to look at the optimality conditions and the 
sparsity of the obtained solution. In order to test it, we run FISTA, APGG, SVRG and MAGMA until the first 
order condition is satisfied with e = 10“® error. As the plots in Figure |7| indicate APGG and SVRG are slower 
to achieve low convergence criteria. Furthermore, as Table [3] shows they also produce denser solutions with higher 
£i-norm. 

5 Conclusion and Discussion 

In this work we presented a novel accelerated multi-level algorithm - MAGMA, for solving convex composite 
problems. We showed that in theory our algorithm has an optimal convergence rate of 0(\/^)^ where e is the 
accuracy. To the best of our knowledge this is the first multi-level algorithm with optimal convergence rate. 

^^For all experiments we use the same standard parameters. 


18 

















(a) Original 
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(j) FISTA 
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Figure 5: Result of FISTA, APCG, SVRG and MAGMA in an image decomposition example with corruption. Top 
row ((a)-(e)) contains the original occluded image and reconstructed images from each algorithm. The middle row 
((f)-(i)) contains the reconstructed noise (e* reshaped as an image) as determined by corresponding algorithms. 
The bottom row ((j)-(m)) contains the solutions x* from each corresponding algorithm. 
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Figure 6: Gomparing the objective values of FISTA, APGG, SVRG and MAGMA after running them for a fixed 
amount of time on dictionaries of various sizes. 
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Figure 7: Comparing the first order optimality criteria of FISTA, SVRG, APCG and MAGMA after running them 
for a fixed amount of time on dictionaries of various sizes. 
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Method 

GPU Time (seconds) 

Number of iterations 

l|x*||i 

FISTA 

1690 

554 

3.98 

APGG 

680 

68,363 (block size 10) 

6.67 

SVRG 

> 10,000 

> 93 (epoch length 17,648) 

5.19 

MAGMA 

526 

144 fine and 120 coarse 

3.09 


Table 3: Running FISTA, APCG, SVRG and MAGMA on the largest dictionary (with 8824 images) until conver¬ 
gence with e = 10“® accuracy or maximum time of 10, 000 GPU seconds returning x* as a solution. 


Furthermore, we demonstrated on several large-scale face recognition problems that in practice MAGMA can be 
up to 10 times faster than the state of the art methods. 

The promising results shown here are encouraging, and justify the use of MAGMA in other applications. 
MAGMA can be used to solve composite problems with two non-smooth parts. Another approach for applying 
FISTA on a problem with two non-smooth parts was given in [dO] . This problem setting is particularly attractive, 
because of its numerous applications, including Robust PGA m- magma’s extension for solving computer vision 
applications of Robust PGA (such as image alignment [42]) is an interesting direction for future research. 

In terms of theory, we note that even though we prove that MAGMA has the same worst case asymptotic 
convergence rate as Nesterov’s accelerated methods, it would be interesting to see, what conditions (we expect it to 
be high correlation of the columns of A) imposed on the problem ensure that MAGMA has a better convergence 
rate or at least a strictly better constant factor. This could also suggest an automatic way for choosing parameters 
K and Kii optimally and a closed form solution for the step size s^. 
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